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^ : SUMMARY 

. We extend the approach of Walker (2003, 2004) to the case of misspecified models. 

A sufficient condition for establishing rates of convergence is given based on a key 
identity involving martingales, which does not require construction of tests. We also 
show roughly that the result obtained by using tests can also be obtained by our 
approach, which demonstrates the potential wider applicability of this method. 
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^ : 1 INTRODUCTION 



Bayesian inference distinguishes itself from the frequentist school by its explicit quan- 
tification of uncertainty of the parameter with prior specification. The classical ap- 
proach uses subjective priors elicited from field experts and domain knowledge. The 
modern Bayesian school have instead shifted attention more towards the construction 
^ ■ of priors using formal rules in the hope of dealing with arbitrariness of the prior. 

■ Asymptotics for infinite-dimensional Bayesian statistics has been receiving a lot 

of attention recently. In these studies, the Bayesian inference is approached from a 
frequentist point of view, that is, we assume there is a true underlying probability 
distribution that generates the data. Naturally one desired property is that as more 
and more observations are made from the underlying generating mechanism, we will 
obtain accurate estimate of the true distribution. While traditional Bayesians do not 
believe in such an assumption, it is shown by Blackwell & Dubins (1962) that this 
property is the same as intersubjective agreement, which means two Bayesians will 
eventually come to roughly the same conclusion after seeing enough data. 

The posterior distribution typically behaves well under regular parametric models. 
Doob showed that consistency is achieved under almost no assumptions on the model, 
except for a zero measure set under the prior, although in topological terms this set 
can be large. For infinite-dimensional models, however, the matter is more subtle. 
Strange behavior can be observed under some priors as documented in Diaconis & 



2 



Freeman (1986). Given the prior 11 on the set V of probabihty distribution, the 
posterior is a random measure: 

n(B|x„...,x„)- jn»^_p(^,)<;n(p) 

For ease of notation, we will omit the conditioning and only write IV^^B) for the 
posterior distribution. We say that the posterior is consistent if 

\r{P eV : d{P, Pq) > e) ^ in P^ probability. 

where Pq is the true distribution and d is some suitable distance function between 
probability measures. 

To study rates of convergence, let be a sequence decreasing to zero, we say the 
rate is at least e„ if for sufficiently large constant M 

n"(P : d{P, Po) > Men) ^ in P^ probability. 

We can also have a slightly weaker definition of rates of convergence by replacing M 
with a sequence M„ and requiring that the above posterior mass converge to zero for 
any sequence M„ that diverges to infinity. 

On the positive side, Schwartz (1965) shows consistency for specific distributions 
by constructing a sequence of tests of the true distribution against distributions some 
positive distance away. The tests can trivially be constructed for weak neighborhoods. 
The construction of similar tests for stronger topology (typically measured in Hellinger 
distance, for example) is not so straightforward and requires extra works. Barron et 
al. (1999) gives sufficient conditions that guarantee consistency of infinite-dimensional 
models by bounding the likelihood ratio under bracketing entropy constraint on sieves. 
Shen & Wasserman (2001) studied rates of convergence. A related approach by 
constructing a sequence of tests appeared in Ghosal et al. (2000). 

The conditions imposed in the above are sufficient but not necessary. It is im- 
portant to see to what extent these conditions can be relaxed. Another line of work 
parallel to the development above by Stephen Walker and his collaborators proves 
consistency and rates of convergence under slightly less stringent conditions. These 
results are established by constructing a certain supermartingale and consistency and 
rates of convergence is shown by focusing on the distance of certain predictive dis- 
tributions to the true one. This approach does not require construction of sequence 
of tests or sieves. It is shown that this new approach can lead to somewhat weaker 
sufficient conditions or faster rates. 

In Kleijn & van der Vaart (2006), the authors consider the situation where one 
cannot expect to achieve consistency since the prior is misspecified. In this case, it 
is not surprising that the posterior will converge to the distribution in the support 
of the prior that is closest to the true distribution measured in KuUback-Leibler 
divergence. Instead of using the usual entropy number or its local version, they used 
a new concept called covering number for testing under misspecification and studied 
rates by constructing a sequence of tests between the true distribution Pq and another 
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measure that is not necessarily a probability distribution. The new entropy number 

can be reduced to the usual entropy in the well-specified case. In this paper, we study 
the posterior distribution also under the misspecified situation, without constructing 
a sequence of tests. 

The goal of this paper is two fold. First, we show that the approach in Walker(2003, 
2004) can be extended to the situation of misspecified prior rather straightforwardly, 

by introducing an a-entropy condition that is slightly stronger than that of Kleijn & 
van der Vaart (2006). Second, we show that using a more refined analysis, a result 
similar to Theorem 2.2 in Kleijn & van der Vaart (2006) can be recovered. In partic- 
ular, it shows that under the well-specified case, this approach indeed is more general 
than the approach of constructing a sequence of tests. 

In §2, we introduce necessary notations and concepts and present the martingale 
construction due to Walker (2003). In §3, we prove the main result and show that 
this approach is somehow more general than the one presented in Kleijn & van der 
Vaart (2006). We end this paper with a discussion in §4. 



2 PRELIMINARIES 

Let {Xi,X2, . . .} be independent samples generated from distribution Pq, with corre- 
sponding lower case letter po denoting the density with respect to some dominating 
measure /x. We are given a collection of distributions V, and a prior 11 on it with 
n(P) = 1. For simplicity, we assume that there exists a unique distribution P* & V 
that achieves minimum value of KuUback-Leibler divergence to the true distribution, 
that is 

Eo(log ^) < Eo(log -), for ^llpeV 
p* p 

where Eq denotes the expectation under the true distribution Pq. 

Let Rn{p) — YYi=iP{^i)/P*{^i): then the posterior mass for a set B is 

Following Kleijn & van der Vaart (2006), for e>0,0<Q;<l and some suitable 
semi-metric d on V, we define the a-covcring of the set A = {P G V : d{P, P*) > e} 
as a collection of convex sets {Ai, A2, . . .} that covers A with the additional property 
that for any j, 

tat^-iogi;„(i^)">^ (2) 

and denote by A^t(e, a, A) the minimum integer N such that there exists {^41, . . . , ^l^v} 
that forms such a cover, if is finite. 

This condition appears to be stronger than the concept of covering for testing un- 
der misspecification introduced by Kleijn & van der Vaart (2006), which only requires 
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that 

inf sup -log£;o(4)">T (3) 

-PeA, o<a<l p 4 

In all the examples they gave in their paper, though, we can find a certain value of 
a only depending on the specification of the model that satisfies our condition. As 
shown in Kleijn & van der Vaart (2006), when V is convex, we have cP{P,P*) < 
— log i?o(p/p*)^^^ where (i is a generalized Hellinger distance defined by (P{Pi,P2) = 
\ I (Pi^"^ ~ P2^'^)'^Po/p* djJ,, which reduces to the usual Hellinger distance in the well- 
specified case. In this situation, the 1/2-covering for testing can be replaced by the 
usual covering as shown in Kleijn & van der Vaart (2006). In general, allowing a to 
be different than 1/2 is required, since in the misspecified case, we cannot guarantee 
that — \ogEo{p/p*y/'^ > 0, and we are obliged to choose some smaller a in order to 
find the covering. 

The predictive density constrained to a general set A is defined as 



PnA{x) = [ pix)Ul{P) 
J A 



, where n^(P) = l{P(zA}H"(P)/T^'^{A) is the posterior measure conditioned on A. 
The key identity noted by Walker (2003) is the following: 

/ RnMpMP) = ^-^^^ [ RnipMP) 
J A P K^n+l) J A 

as can be verified easily. This in turn implies that 

^o[( / RnM^(p)r\xi,...,Xr,] = ( / Rn(pMP)rEo(^r (4) 

J A J A P 

which means that /^i?„(p)n(P) is a supermartingale when Eo{pnA/p*)'^ < 1- 



3 RATES OF CONVERGENCE 

To study rates of convergence, for a sequence — > 0, we let An = {P G V : 
d{P,P*) > Men}, and let Anj be an a-covering of An, i.e., {^nj} are convex sets 
that covers A^ and 

inf -\ogEo{-r>—^ 

Define 

7, 

and 



Lt] = I RkipMP) (5) 



In= f RnipMP) 

Jv 
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To obtain a lower bound for /„, which is the denominator in ([T]), we also need a 
condition on the prior mass for a Kullback-Leibler neighborhood of p*, which is defined 

as 

B{e, P*- P,) = {PEV: -E,{log ^) < e\ Eo(log < 

Theorem 1 Assume that P* is the unique minimizer in V of the Kullback-Leibler 
divergence to the true distribution with i?o(log(po/p*)) < oo. For a sequence such 
that e„ — >• and ne^ — >• oo, and An, Anj defined as above. If the following conditions 
hold 

1) e""*""^ ^^(Anj)" for a sufficiently large constant K 

2) n(_B(e„, P*; Pq)) > e^-^"*"" for a sufficiently large constant L 
then n"(P : rf(P, P*) > Me„) ^ «n P^ probability. 

Proof. First we observe that PnA^j £ ^n,j by the convexity of A^j. From the 
definition of a-covering, infpg^^^ . — log i?o(p/p*)" > so the predictive density 

satisfies 

p* ~ 

Taking expectations in (jlj), with A replaced by we get 
and hence 

i?o(<])" < e--^''^'-l\ll{An,)r 
The posterior distribution can be bounded as follows: 
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Lemma 7.1 in Kleijn & van der Vaart (2006) shows that when ne^ — > oo, for every 
C > 0, on a set Vtn with probability converging to 1, we have /„ > n(P(e„, P*; Po))e~"''"*^^+'"\ 
so we can write 

Eo{jr{An)) = Eo(n'^(A„)if,j + Eo(n"(A„)i^,c) 



n(P(e„, P*; Po))"e-""^n(i+^) 



which converges to zero by condition 1) if M is sufficiently large. □ 



For a compact set of models P, we can use the trivial bound 

^n(A„,,)"<iV,(e„,a,A„) 
j 

, which gives the following result similar to Theorem 2.1 in Kleijn & van der Vaart 
(2006), while they used a local version of the entropy instead. 

Theorem 2 If instead of condition 1) in TheoremUl we assume Nt{en, a, An) < e^^", 
then for sufficiently large constant M , 

n"(P : d{P,P*) > Men) in probaMUty. 

In order to get optimal rate for parametric models, Kleijn & van der Vaart (2006) 
used a more refined assumption. In place of condition 2) in Theorem [1] above, they 
assumed 

niP:Jen<diP,P*)<2Jel) ^ 



U{B{en,P*;Po)) 



< e"<-^ (6) 



for all natural numbers n and J. In order to recover this result, we need a more 
careful analysis. 

First, we define = {P e V : M„Je„ < d{P, P*) < 2MnJtn}, with a-covering 
{y4;(j} defined similarly as before with the property: infpg^j — log£^o(p/p*)" > 
Let A'l^ j = A;^^ !"]^;^, note that A'l^ j might not be convex even though A-^^ ^ 
is constrained to be so. Similarly, we can define -^i"-''^ as in ([5]) with An^j replaced by 
A-^j. It is easy to see that the following still holds: 

Eo{L^^liX = EoiLt'yrEoi^-^r < i?o(4j'^)"e-''"''^"/' 

even though A:^ might be nonconvex, since P^xj is still contained in A:^ though 

not necessarily in A;^^. 

With Af^ playing the role of An before, the same strategy in the proof of Theorem 
[Dean be followed to show that 



E,{n-{Ai)ln^) < 



n(5(e„, P*; p^)Ye-''nei{i+c) 
< e 



n(i?(e„,P*;Po))" 



We will use the notation to denote the a-covering number for A^. We are now 
ready to prove the following: 

Theorem 3 Assume that P* is the unique minimizer of the Kullback-Leibler diver- 
gence to the true distribution with Eo{\og{po/p*)) < 00. For a sequence e„ such that 
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e„ — > and ne^ hounded away from zero, and A^, {A^^^}^!^-^ defined as above. If the 
following conditions hold 

1) Nf < e"^" for allJ>l 

2) ^ is satisfied 
Then we have 

n"(P:t/(P,P*)>M„eJ^O 
in probability for any sequence Mn — > oo 
Proof. We start by writing 

oo 

EomA^)) = Y^E.mAi)) 
j=i 

J 

we can bound the inner sum for each fixed J as 

j^mi^r < Nfu{Air < e-''-u{Air 

since A:^ .j C A:^ and using condition 1). Plugging this into (171) and using condition 
2), we get 

£'o(n"(A„)) < ^-nelMp^/4.+anel{l+C)+nel+anelMp'^/8 _^ Pq[Q'^) 

J>1 

By Lemma 7.1 of Kleijn & van der Vaart (2006), Po(fi^) can be made arbitrarily 
small by choosing C sufficiently large, under the condition that ne^ is bounded away 
from zero. For any C, the sum above converges to zero since M„ —>■ oo. □ 



4 DISCUSSION 

We demonstrated that rates of convergence of posterior distribution under misspeci- 
fication can be established without construction of a sequence of tests. Theorem |3l we 
derived above is slightly weaker than Theorem 2.2 in Kleijn & van der Vaart (2006) 
due to our use of assumption ([2]), which is stronger than ([3|). This said, we are not 
aware of any examples where the weaker condition provides any advantage over 
(I2|). In Walker (2007), the authors demonstrated that using the martingale approach 
can improve on the rates slightly for some problems. Theorem [31 shows that the re- 
sults by Kleijn & van der Vaart (2006) is implied by our result, this is precisely true 
for well-specified problem, while for misspecified problem this is not conclusive due 
to the reason stated above. Unfortunately, we have not been able to construct an 
example that this approach provides a faster rate. 
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The extension to the case that the prior U depends on n, and the case that there 
exists a finite number of points at minimal KuUback-Leibler divergence to the true 
distribution should be straightforward. 
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